R is a tool designed by statisticians and as such there is an immense amount of options to perform exploratory data analysis tasks. This section will cover just a few of those methods. In the previous sections, we did our setup tasks, including loading required packages and uploading our required data.
Below, we’ll cover a few examples of functions included in base R to quickly summarize our data. We’ll also leverage package functions to create easy frequency tables to show off some of the flexibility packages afford us.
PATH <- "S:/Pricing/Data/Demand/"
DATA <- read_fst(paste0(PATH,"Closing_a_cw_20181031_by_driver_CTR_QC.fst"))
Assess the dimensions of the data
dim(DATA)
[1] 172489 987
There are a few ways to visualize your data, the simplest is head() which gives you the first line of your data. Also, if your data is too large to visualize in the console, you can do view() or click the table from your environment.
head(DATA)[1:15]
# To view the head of the data in the code editor
View(head(DATA))
Mean, Min, Max and Quantile by predictor variable
summary(DATA[,c('Clo_Sale_In', 'Dri_Age_Nb', 'Veh_Age_Nb', 'Dri_Yrs_Licensed_AU_Nb')])
Clo_Sale_In Dri_Age_Nb Veh_Age_Nb Dri_Yrs_Licensed_AU_Nb
Min. :0.0000 Min. :16.00 Min. :-1.000 Min. : 0.00
1st Qu.:0.0000 1st Qu.:27.00 1st Qu.: 1.000 1st Qu.: 6.00
Median :0.0000 Median :35.00 Median : 4.000 Median :13.00
Mean :0.2658 Mean :37.75 Mean : 5.173 Mean :16.81
3rd Qu.:1.0000 3rd Qu.:46.00 3rd Qu.: 8.000 3rd Qu.:25.00
Max. :1.0000 Max. :99.00 Max. :96.000 Max. :75.00
Two or more way frequency table
DATA %>%
group_by(Clo_Sale_In,Dri_Gender_Cd,Dri_Marital_Status_In,Clt_Insurer_Tx) %>%
summarise(n = n())
How to quickly spot data errors?
A common known data issue is Multiline discount and the Client home main location variable; If the indicator is showing a multiline discount then we should be able to see a client main location for that client.
Here is a quick way to spot this data issue:
DATA %>%
filter(Quo_Type_Tx == "Inbound") %>%
group_by(Rat_DISML_In,Clt_Home_Main_Location_Cd) %>%
summarise(n = n())
Before diving into an example using a real work situation, let’s cover a note on memory usage.
It is important to understand what is happening in the background when using R and users should be aware of how much memory is being used. Basically, R is a memory based software where all the data is stored on memory and not on the hard drive like other softwares. This means it’s important to manage your memory usage in order to avoid running into memory limitations.
mem_used()
953 MB
object_size(DATA)
863 MB
Example: I will copy data object to another object called x just for illustration purposes. mem_change function determines the change in memory from running the code. By creating a new data (x) we increase the memory used by 376 B.
mem_change(x <- DATA)
376 B
Then, by removing the data x, we get 320 B of memory back.
mem_change(rm(x))
320 B
Let’s say the manager of your team wants you to look at the profitability on new businesses (Quotes that closed).
First thing we will do is to score our loss cost model on our data. Currently, to score LC models for Quebec in R - we use the RUO package.
The first step of this process is to load the file 00 - Functions which is the data preparation part specific to Quebec PPA data. Then, we load our different loss cost models (by coverage) that we have in excel. 1 & 2 in the model names refer to “NIC” and “NonNIC” portions of the loss cost.
library(RUO)
source("S:/Pricing/Innovation/R Ressources/R - Training 2.0/Training/00 - Functions.R")
PATH_RUO <- 'S:/Pricing/Innovation/R Ressources/R - Training 2.0/Training/RUO_LC_QC_PPA_V2.0/'
LC_MODEL <- EXCEL.to.R(paste0(PATH_RUO,"LC_QC_PPA_V2.0_201706.xls"))
# These variable attribute will be used later. They are parameters needed for RUO
MODEL_NAMES <- c("TPL1","TPL2","COLL1","COLL2","COMP1","COMP2")
Uniq_Keys <- c("Clt_Insurer_No", "Clt_Account_No", "Pol_Policy_No", "Veh_Id_No", "Dri_Id_No")
Prior to scoring, some modifications on the data need to be made. This in-house function makes sure the data and model match up properly. You can open the function to see what changes is been made to the data.
DATA <- data_modifications(DATA = DATA)
Final step is to use the APPLY.MOD or APPLY.MOD.lm function which will allow you to specify your link function, the name of your model and the name of your data to properly score it.
APPLY.MOD() - returns model predictions as well as the banded variables and relativities of each variable and is generally used for validation purposes. The idea here is to show the relativity variables created.
DATA <- as.data.frame(DATA)
DATA_RELS <- APPLY.MOD(RMOD = LC_MODEL,
link_Fnc = "log",
Uniq_Keys = Uniq_Keys,
DATA = DATA,
Split = F,
MODEL_NAME = MODEL_NAMES)
Here is to show the relativity variables created. You could do the same exercice by replacing _Rel with _Bnd to see the banded variables.
head(names(DATA_RELS)[grep('_Rel', names(DATA_RELS))])
[1] "TPL1_Clt_Credit_Score_No_Rel" "TPL1_Clt_Previous_Ins_In_Rel"
[3] "TPL1_Clt_Yrs_Client_Since_Nb_Rel" "TPL1_Dri_Age_Nb_Rel"
[5] "TPL1_Dri_Gender_Cd_Rel" "TPL1_Dri_Type_Cd_Rel"
APPLY.MOD.lm() - can return the final score only, but also the relativities if we want (and it’s much faster than APPLY.MOD because it can run in parallel using C++ coding)
DATA_All <- APPLY.MOD.lm(RMOD = LC_MODEL,
link_Fnc = "log",
Uniq_Keys = Uniq_Keys,
DATA = DATA,
MODEL_NAME = MODEL_NAMES,
KEEP_DATA = T,
KEEP_RELS = F)
# Calculate the total LC
DATA_Scored <- DATA_All %>%
mutate(TPL_PRED = round(TPL1_PRED*TPL2_PRED),
COLL_PRED = round(COLL1_PRED*COLL2_PRED),
COMP_PRED = round(COMP1_PRED*COMP2_PRED),
LC_TOTAL = TPL_PRED + COLL_PRED + COMP_PRED)
The Scoring Object System initiative aims to replace Radar with R scripts. The RUO package is the main component of this project, which takes Radar exports such as the Bandings and Multivariate tables and converts them into an R-usable format. We then created scripts to score this new object (the Scoring Object) in R. The above block of code gives an example of this in action.
Let’s now use our scored object to perform a rebase. We will also use this opportunity to break down the more common functions used in the dplyr package.
We must create a data frame with indicated loss cost by vehicle and by company first. Usually, we would do the exercice also by coverage, but for simplicity, we rebase only by Insurer in this exercice.
DATA_Indicated = data.frame( Clt_Insurer_Tx = c("PIC", "SN", "TDHA"),
Indicated_Total_Avg_Am = c(1160, 1112, 860))
We still need the loss costs to be equal at the aggregate level. To put the current model on the same average level as indicated, we must calculate the rebasing factor
The code below can seem intimidating so we will give a more in depth breakdown of some of the main functions used with dplyr. If you are familiar with SAS you will recognize a lot of these functions
select: A function used to choose the variables you need
group_by: Grouping by allows for the user to group a long a list of characteristics. Frequently used with either summarise or previously seen slice(1).
summarise: Allows you to calculate aggregated variables for a group such as the average, sum, standard deviation and others
filter: Allows you to only take observation that match a certain criteria, thus allowing you to subset your data
left_join: Allows you to join two tables together joined by a variable (of multiple ones), which acts as a unique key
mutate: Allows you to add more variables to your tables. Mutate creates new variables, you can compare it with the data; set; step in SAS.
We’ll include a more detailed breakdown of calculating the rebase factor, and we’ll include the full step in one calculation underneath it.
The first step involves a calculation of the Loss Cost for the respective vehicles. For that we sum our loss cost by vehicle using the group_by function.
Rebase_Factor_1 <- DATA_Scored %>%
group_by(Clt_Insurer_Tx,
Clt_Account_No,
Pol_Policy_No,
Veh_Id_No) %>%
summarise(LC_Veh_Total = sum(LC_TOTAL))
We then group by our company to find the average loss cost for each company. na.rm is an option which exclude the NA values for the mean calculation (not from the data). If we have a NA value and we don’t add this option, mean will return NA.
Rebase_Factor_2 <- Rebase_Factor_1 %>%
group_by(Clt_Insurer_Tx) %>%
summarise(Exposure = sum(n()),
LC_Veh_Total_Avg = mean(LC_Veh_Total, na.rm = TRUE))
Finally we join these loss cost vehicle averages to the indicated total amount to calculate the rebasing factor. Because we should not have TDGI in our data (only one exposure), we will remove it
Rebase_Factor_3 <- Rebase_Factor_2 %>%
left_join(DATA_Indicated) %>%
filter(Clt_Insurer_Tx != "TDGI") %>%
mutate(Rebase_Factor = Indicated_Total_Avg_Am/LC_Veh_Total_Avg)
And this can all be combined into this one code below
Rebase_Factor <- DATA_Scored %>%
group_by(Clt_Insurer_Tx,
Clt_Account_No,
Pol_Policy_No,
Veh_Id_No) %>%
summarise(LC_Veh_Total = sum(LC_TOTAL)) %>%
group_by(Clt_Insurer_Tx) %>%
summarise(Exposure = sum(n()),
LC_Veh_Total_Avg = mean(LC_Veh_Total, na.rm=TRUE)) %>%
left_join(DATA_Indicated) %>%
filter(Clt_Insurer_Tx != "TDGI") %>%
mutate(Rebase_Factor = Indicated_Total_Avg_Am/LC_Veh_Total_Avg)
Once the rebasing factor is calculated, we can rebase the scored total loss cost to have the same average indicated loss cost for each company
Apply rebasing factor
DATA_Scored <- DATA_Scored %>%
left_join(Rebase_Factor) %>%
mutate(LC_TOTAL_Rebased = LC_TOTAL * Rebase_Factor)
Before moving on to plotting, let’s conclude our example with a profitability analysis. First looking at expected profitability of the quotes that closed by company
AGG.VEH <- DATA_Scored %>%
group_by(Clt_Insurer_Tx,
Clt_Account_No,
Pol_Policy_No,
Veh_Id_No) %>%
summarise(LC_Veh_Total = sum(LC_TOTAL))
AGG.INSURER <- DATA_Scored %>%
left_join(AGG.VEH) %>%
filter(Clo_Sale_In == 1 & Dri_Type_Cd == 'P') %>% # To have the data by vehicle
select(LC_Veh_Total,
Prm_Trm_Veh_Tot_Am,
Clt_Insurer_Tx) %>%
group_by(Clt_Insurer_Tx) %>%
summarise(Prm_Trm_Veh_Mean = mean(Prm_Trm_Veh_Tot_Am),
LR_Veh_Mean = sum(LC_Veh_Total, na.rm = TRUE)/sum(Prm_Trm_Veh_Tot_Am),
Xpo_Veh_Nb = sum(n()))
Then we can calculate the expected profitablity of the quotes that closed by Marital Status and Driver Age
AGG.DRI_AGE <- DATA_Scored %>%
left_join(AGG.VEH) %>%
filter(Clo_Sale_In == 1 & Dri_Type_Cd == 'P') %>% # To have the data by vehicle
select(LC_Veh_Total,
Prm_Trm_Veh_Tot_Am,
Dri_Age_Nb,
Dri_Marital_Status_In) %>%
group_by(Dri_Age_Nb,
Dri_Marital_Status_In) %>%
summarise(Prm_Trm_Veh_Mean = mean(Prm_Trm_Veh_Tot_Am),
LR_Veh_Mean = sum(LC_Veh_Total, na.rm = TRUE)/sum(Prm_Trm_Veh_Tot_Am),
Xpo_Veh_Nb = sum(n()))
And finally, we can calculate the expected profitablity of the quotes that closed by the Gender of the principal driver and his number of years licensed
AGG.YRS_LIC <- DATA_Scored %>%
left_join(AGG.VEH) %>%
filter(Clo_Sale_In == 1 & Dri_Type_Cd == 'P') %>% # To have the data by vehicle
select(LC_Veh_Total,
Prm_Trm_Veh_Tot_Am,
Dri_Gender_Cd,
Dri_Yrs_Licensed_AU_Nb) %>%
group_by(Dri_Gender_Cd,
Dri_Yrs_Licensed_AU_Nb) %>%
summarise(Prm_Trm_Veh_Mean = mean(Prm_Trm_Veh_Tot_Am),
LR_Veh_Mean = sum(LC_Veh_Total, na.rm = TRUE)/sum(Prm_Trm_Veh_Tot_Am),
Xpo_Veh_Nb = sum(n()))
Finally, we will cover R’s plotting functionality before moving onto our functions lesson. R has two main packages which offer extremely powerful options for plotting: ggplot2 and plotly. Each have their own strengths and weaknesses, but ultimately they’re both powerful enough that your decision can come down to personal preference. In this lesson, we’ll use plotly, but the reader is encouraged to look at ggplot2 if they are interested.
Just as an fyi, ggplot2 does not allow to have 2 y-axis. In that case, the package plotly is preferred
Loss Ratio by Insurer
The goal of this graph is to be able to have both a line graph and a bar graph within the same plot. This allows user to visualize multiple aspects of the data at the same time. In this case the bar graph will represent exposure for each insurer while the line graph overlaying it will represent the average loss ratio.
For building plots we can use a similar step by step approach as dplyr
P_LR_Insurer <- plot_ly(data = AGG.INSURER) %>%
add_lines(
x = ~Clt_Insurer_Tx,
y = ~LR_Veh_Mean,
name = "Loss Ratio",
yaxis = "y"
)
Allows more and more graphic items to be added on. All that needs to be done is to know the correct grammar of the programming language.
P_LR_Insurer2 <- plot_ly(data = AGG.INSURER) %>%
add_lines(x = ~Clt_Insurer_Tx, y = ~LR_Veh_Mean,
name = "Loss Ratio",
yaxis = "y"
) %>%
add_bars(x = ~Clt_Insurer_Tx, y = ~Xpo_Veh_Nb,
name = "Exposure",
yaxis = "y2"
) %>%
layout(
title = "Loss Ratio by Insurer",
xaxis = list(title = "Insurer"),
yaxis = list(title = "Loss Ratio", side = "left", overlaying = "y2"),
yaxis2 = list(title = "Exposure", side = "right")
)
Rescale y-axis
P_LR_Insurer3 <- P_LR_Insurer2 %>%
layout(
yaxis = list(
range = c(0, 2)
)
)
Marital x Driver Age
AGG.DRI_AGE <- ungroup(AGG.DRI_AGE)
P_LR_Marital_Age <- plot_ly(data = AGG.DRI_AGE, color =~Dri_Marital_Status_In, colors = c("red","darkgreen")) %>%
add_bars(x=~Dri_Age_Nb, y=~Xpo_Veh_Nb, yaxis="y2", opacity = .5)%>%
add_lines(x=~Dri_Age_Nb, y=~LR_Veh_Mean) %>%
layout(
barmode="stack",
title = "Loss Ratio by Driver Age and Marital Status",
xaxis = list(title = "Driver Age", range = c(16, 70)),
yaxis = list(title = "Loss Ratio", side = "left", overlaying = "y2"),
yaxis2 = list(title = "Exposure", side = "right")
)
Gender x Yrs Licensed
P_LR_Gender_YL <- plot_ly(data = AGG.YRS_LIC, color =~Dri_Gender_Cd, colors = c("red","darkgreen")) %>%
add_bars(x=~Dri_Yrs_Licensed_AU_Nb, y=~Xpo_Veh_Nb, yaxis="y2")%>%
add_lines(x=~Dri_Yrs_Licensed_AU_Nb, y=~LR_Veh_Mean) %>%
layout(
barmode="group",
title = "Loss Ratio by Gender and Years Licensed",
xaxis = list(title = "Years Licensed", range = c(-1, 60)),
yaxis = list(title = "Loss Ratio", side = "left", overlaying = "y2"),
yaxis2 = list(title = "Exposure", side = "right")
)
We can clearly see that graphs of the same type have a very similar code structure allowing users to create new graphs easily
We can now save these graphs to html files that can be used for later analysis. Do not forget to change the path for your own directory.
path <-"S:/Pricing/Innovation/R Ressources/R - Training 2.0/Training/"
saveWidget(P_LR_Insurer,file.path(normalizePath(dirname(paste0(path, "Clt_Insurer_Tx", ".html"))),basename(paste0(path, "Clt_Insurer_Tx", ".html"))))
saveWidget(P_LR_Gender_YL,file.path(normalizePath(dirname(paste0(path, "Gender_x_YL", ".html"))),basename(paste0(path, "Gender_x_YL", ".html"))))